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Abstract 

Machine learning is widely used to analyze biological sequence data. Non-sequential models such as SVMs 
or feed-forward neural networks are often used although they have no natural way of handling sequences of 
varying length. Recurrent neural networks such as the long short term memory (LSTM) model on the other hand 
are designed to handle sequences. In this study we demonstrate that LSTM networks predict the subcellular 
location of proteins given only the protein sequence with high accuracy (0.902) outperforming current state of 
the art algorithms. We further improve the performance by introducing convolutional filters and experiment with 
an attention mechanism which lets the LSTM focus on specific parts of the protein. Lastly we introduce new 
visualizations of both the convolutional filters and the attention mechanisms and show how they can be used to 
extract biological relevant knowledge from the LSTM networks. 


1. INTRODUCTION 


Deep neural networks have gained popularity for a wide range of classification tasks in image recognition and speech 
taggin g (Dahl et al.U2012tlKrizhevskv et all 1201 21) and recently also within biology for prediction of exon skipping events 
(Xio ng et all 2014 ). Furthermore a surge of interest in recurrent neural networks (RNN) has followed th e recent impres 


sive results shown on challengin g sequential proble ms like machine translation and speech recognition (Bah danau et al 


2014i iGraves & Jaitlvl 12014: Suts kever et al.L 1201 4) . Within biology, sequence analysis is a very common task used for 
prediction of features in protein or nucleic acid sequences. Current methods generally rely on neural networks and sup¬ 
port vector machines (SVM), which have no natural way of handling sequences of varying length. Furthermore these 
systems rel y on highly hand-engineered input features requ iring a high degree of domain knowledge when designing the 
algorithms (Emanu elsson et all 2007t Petersen et al. . 2011 ). This paper uses the long short term memory network (LSTM) 
(Hochreiter et al., 1997) to analyze biological sequences and predict to which subcellular compartment a protein belongs. 
This prediction task, known as protein sorting or subcellular localization, has attracted large interest in the bioinformat¬ 
ics field ( Emanuelsson et all 2007h . We show that an LSTM network, using only the protein sequence information, has 
significantly better performance than current state of the art SVMs and furthermore have nearly as good performance as 
large hand engineered systems relying on extensive metadata such as GO terms and evolutionary phylogeny, see Figure [4] 


large nand engineered systems relying on extensive metadata sucn as UU terms and evolutionary pnylogeny, see Ligure|4] 
(Blu m et all 120091: iBriesemeister et all 20091: Hoglund et al. . 2006 ). These results show that LSTM networks are efficient 
algorithms that can be trained even on relatively small datasets of around 6000 protein sequences. Secondly we investigate 
how an LSTM network recognizes the sequence. In image recognition, convolutional neural networks (CNN) have shown 
state of the art performance in several different tasks ( Cunn et all 1990i Krizhevskv et al. . 2012h . Here the lower layers 
of a CNN can often be interpreted as feature detectors recognizing simple geometric entities, see Figure Q] We develop a 






































































Convolutional LSTM Networks for Subcellular Localization of Proteins 


simple visualization technique for convolutional filters trained on either DNA or amino acid sequences and show that in 
the biological setting filters can be interpreted as motif detectors, as visualized in FigureQ] Thirdly, inspired by the work of 



Figure 1. Left: First layer convolutional filters learned in iKrizhevskv et al.L 12012f) . note that many filters are edge detectors or color 
detectors. Right: Example of learned filter on amino acid sequence data, note that this filter is sensitive to positively charged amino 
acids. 


Bahdanau et al., we augment the LSTM network with an attention mechanism that learns to assign importance to specific 
parts of the protein sequence. Using the attention mechanism we can visualize where the LSTM assigns importance, and 
we show that the network focuses on regions that are biologically plausible. Lastly we show that the LSTM network learns 
a fixed length representation of amino acids sequences that, when visualized, separates the sequences into clusters with 
biological meaning. The contributions of this paper are: 


1. We show that LSTM networks combined with convolutions are efficient for predicting subcellular localization of 
proteins from sequence. 

2. We show that convolutional filters can be used for amino acid sequence analysis and introduce a visualization tech¬ 
nique. 

3. We investigate an attention mechanism that lets us visualize where the LSTM network focuses. 


4. We show that the LSTM network effectively extracts a fixed length representation of variable length proteins. 


2. MATERIALS AND METHODS 

2.1. MODEL 

This section introduces the LSTM cell and then explains how a regular LSTM (R-LSTM) can produce a single output. 
We then introduce the LSTM with attention mechanism (A-LSTM), and describes how the attention mechanism is imple¬ 
mented. 


2.1.1. LSTM NETWORK 


The LSTM cell is implemen ted as described in ( Graves! 20131) except for peepholes, because recent p apers have shown 
good performance without ( Sutskever et al. . 2014T Zaremba & Sutskevei . 2014 : Zaremba et al. . 2014bl) . Figure [2] shows 
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the LSTM cell. Equations ilTli- dTol i state the forward recursions for a single LSTM layer. 


it = a(D(x t )W xi + ht-iW h i + hi) 

(1) 

ft = (j(D(x t )W x f + ht-iW h f + b f ) 

(2) 

g t = tauh(D(x t W xg ) + h t -iW hg + b g ) 

(3) 

c t = ft® c t ~ i +i t Qg t 

(4) 

o t = <r (D(x t )W xo + ht-iW h o + b 0 ) 

(5) 

h t = o t ® tanh(ct) 

(6) 

" {Z) - 1 + exp(- 2 ) 

(7) 

© : Elementwise multiplication 

(8) 

D : Dropout, set values to zero with probability p 

(9) 

Xt : input from the previous layer: h l t ~ 1 

GO) 


Where all quantities are given as row-vectors and activation and dropout functions are applied element-wise. If dropout is 


used it is only applied to non-recurrent connections in the LSTM cell (Zaremba et al 
passed upwards to the next layer. 


2014a.’). In a multilayer LSTM h t is 



Figure 2. LSTM memory cell, i: input gate,/: forget gate, o: 
output gate, g: input modulation gate, c: memory cell. The 
Blue arrow heads refers to ct- 1 . The notation corresponds 
to equations ITI to 1 1 01 such that W xo denotes wights for x to 
output gate and Whf denotes weights for ht -i to forget gates 
etc. Adapted from dZaremba & Sutskeved . 120141) . 



Figure 3. A-LSTM network. Each state of the hidden units, 
h t are weighted and summed before the output network cal¬ 
culates the predictions. 


2.2. REGULAR LSTM NETWORKS LOR PREDICTING SINGLE TARGETS 


When used for predicting a single target for each input sequence, one approach is to output the predicted target from the 
LSTM network at the last sequence position as shown in FigureQ] A problem with this approach is that the gradient has to 
flow from the last position to all previous positions and that the LSTM network has to store information about all previously 
seen data in the last hidden state. Furthermore a regular bidirectional LSTM (BLSTMi rtSchuster & PaliwalL 1997) is not 
useful in this setting because the backward LSTM will only have seen a single position, xt , when the prediction has to be 
made. We instead combine two unidirectional LSTMs, as shown in figure HJT, where the backward LSTM has the input 
reversed. The prediction from the two LSTMs are combined before predictions. 
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Figure 4. A: Schematic indicating how MultiLoc combines predictions from several sources to make predictions whereas the LSTM 
networks only rely on the sequence dHoglund et all[20061) . B: Unrolled single layer BLSTM. The forwards LSTM (red arrows) starts 
at time 1 and the backwards LSTM (blue arrows) starts at time T, then they go forwards and backwards respectively. The errors from 
the forward and backward nets are combined and a prediction is made for each sequence position. Adapted from dGrave in. C: 
Unidirectional LSTM for predicting a single target. All targets except for the target at the last position are masked. Squares are LSTM 
layers. 


2.3. ATTENTION MECHANISM LSTM NETWORK 


Bahdanau et al. ( Bahdanau et al. . 2014I) . have introduced an attention mechanism for combining hidden state information 
from a encoder-decoder RNN approach to machine translation. The novelty in their approach is that they use an alignment 
function that for each output word finds important input words, thus aligning and translating at the same time. We modify 
this alignment procedure such that only a single target is produced for each sequence. The developed attention mechanism 
can be seen as assigning importance to each position in the sequence with respect to the prediction task. We use a BLSTM 
to produce a hidden state at each position and then use an attention function to assign importance to each hidden state 
as illustrated in figure [3] The weighted sum of hidden states is used as a single representation of the entire sequence. 
This modification allows the BLSTM model to naturally handle tasks involving prediction of a single target per sequence. 
Conceptually this corresponds to adding weighted skip connections (green arrow heads Figure[3]) between any ht and the 
output network, with the weight on each skip connection being determined by the attention function. Each hidden state h t , 
t = 1,..., T is used as input to a Feed Forward Neural Network (FFN) attention function: 

a t = tanh(/i t W 0 )uJ , (11) 

where W a is an attention hidden weight matrix and v a is an attention output vector. From the attention function we form 
softmax weights: 

exp (a*) 


at = 


££ =1 exp (a t ') 


that are used to produce a context vector c as a convex combination of T hidden states: 

c = T,J =1 h t a t . 


( 12 ) 


(13) 


The context vector is then used is as input to the classification FFN /(c). We define / as a single layer FFN with softmax 
outputs. 


2.4. SUBCELLULAR LOCALIZATION DATA 

The model was trained and evaluated on the dataset used to train the MultiLoc algorithm published by Hglund et al. 
( Hoglund et al. , 2006! I~1 The dataset contains 5959 proteins annotated to one of 11 different subcellular locations. To 


http://abi.inf.uni-tuebingen.de/Services/MultiLoc/multiloc_dataset 
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reduce computational time the protein sequences were truncated to length 1000. We truncated by removing from the 
middle of the protein as both the N- and C-terminal regions are known to contain sorting signals ( Emanuelsson et al 


2007 ), Each am ino acid was encoded using 1-of-K encoding, the BLOSUM80 (Henikoff & Henikoff 


1992) and HSDM 


(Prlic et al. . 2000h substitution matrices and sequence profiles, yielding 8 0 features per amino acids. Sequence profiles 
where created with ProfilePrc@ using 3 blastpgpQ iterations on UNIREF50 ( Magrane et al. . 2011 ). 


2.5. VISUALIZATIONS 

Convolutional filters for images can be visualized by plotting the convolutional weights as pixel intensities as shown in 
figureQ] However a similar approach does not make sense for amino acid inputs due to the 1-of-K vector encoding. Instead 
we view the ID convolutions as a position specific scoring matrix (PSSM). The convolutional weights can be reshaped into 
a matrix of l futer-by-lenc, where the amino acid encoding length is is 20. Because the filters show relative importance we 
rescale all filters such that the height of the highest column is 1. Each filter can then be visualized as a PSSM logo, where 
the height of each column can be interpreted as position importance and the height of each letter is amino acid importance. 
We use Seq21ogo with the PSSM-logo setting to create the convolution filter logos ( Thomsen & Nielsen . 201 2| ). 


We visualize the importance the A-LSTM network assigns to each position in the input by plotting a from equa¬ 
tion [12] Lastly we extract and plot the hidden representation from the LSTM networks. Lor the A-LSTM network we 
use c from equation [l3l and for the R-LSTM we use the last hidden state, h t . Both c and h t can be seen as fixed length 
representation of the amino acid sequences. We plot the representation using t-SNE ( Van Per Maaten & Hinton . 20081) . 


2.6. EXPERIMENTAL SETUP 

All models were implemented in Theano ( Bastien et al. . 2012 ) using a modified version of the Lasagne libranQ and trained 
with gradient descent. The learning rate was controlled with ADAM (a = 0.0002, /3 1 = 0.1, fa = 0.001, e = 10 s and 
A = 10~ 8 ) ( Kingma & Ba . 20141) . Initial weights were sampled uniformly from the interval [-0.05, 0.05], The network 
architecture is a ID convolutional layer followed by an LSTM layer, a fully connected layer and a final softmax layer. All 
layers use 50% dropout. The ID convolutional layer uses convolutions of sizesj_, 3, 5, 9, 15 and 21 with 10 filters of 
each size. Dense and convolutional layers use ReLU activation ( Nair & Hinton . 2010 ) and the LSTM layer uses hyperbolic 
tangent. Lor the A-LSTM model the size of the first dimension of W a was 400. Based on previous experiments we trained 
for 100 epochs for all models and used 4/5 of the data for training the last 1/5 of the data for testing. 


3. RESULTS 

Table [T| shows accuracy for the R-LSTM and A-LSTM models and several other models trained on the same dataset. 
Comparing the performance of the R-LSTM, A-LSTM and MultiLoc models, utilizing only the sequence information, the 
R-LSTM model (0.879 Acc.) performs better than the A-LSTM model (0.854 Acc.) whereas the MultiLoc model (0.767 
Acc.) performs significantly worse. Lurthermore the 10-ensemble R-LSTM model further increases the performance to 
0.902 Acc. Comparing this performance with the other models, combining the sequence predictions from the MultiLoc 
model with large amounts of metadata for the final predictions, only the Sherloc2 model (0.930 Acc.) performs better than 
the R-LSTM ensemble. Ligure[5]shows a plot of the attention matrix from the A-LSTM model. Ligure[T]shows examples 
of the learned convolutional filters. Ligure[6]shows the hidden state of the R-LSTM and the A-LSTM model. 


4. DISCUSSION AND CONCLUSION 

In this paper we have introduced LSTM networks with convolutions for prediction of subcellular localization. Table |T] 
shows that the LSTM networks perform much better than other methods that only rely on information from the sequence 
(LSTM ensemble 0.902 vs. MultiLoc 0.767). This difference is all the more remarkable given that our method is 
biologically naive, only utilizing the sequences and their localization labels, while MultiLoc incorporates specific domain 
knowledge such as known motifs and signal anchors. One explanation for the performance difference is that the LSTM 
networks are able to look at both global and local sequence features whereas the SVM based models do not model global 

“http://download.igb.uci.edu/ 

"http://nebc.nox.ac.uk/bioinformatics/docs/biastpgp.html 
1https://github.com/skaae/nntools 
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Table 1. Comparison of results for LSTM models and MultiLocl/2. 
MultiLocl/2 accuracies are reprinted from (Goldberg et al., 2012) 
and the SherLoc accuracy from (Briesemeister et al., 2009). 


Model 

Accuracy 

Input: Protein Sequence 


R-LSTM 

0.879 

A-LSTM 

0.854 

R-LSTM ensemble 

0.902 

MultiLoc 

0.767 

Input: Protein Sequence + Metadata 


MultiLoc + PhyloLoc 

0.842 

MultiLoc + PhyloLoc + GOLoc 

0.871 

MultiLoc2 

0.887 

SherLoc2 

0.930 


Table 2. True labels are shown by row and model predictions by col¬ 
umn. E.g. row 4 column 3 means that the actual class was Cytoplas¬ 
mic but the model predicted Chloroplast. 


Matrix 


ER 

26 

1 

0 

0 

8 

1 

0 

0 

0 

3 

0 

Golgi 

1 

28 

0 

0 

0 

0 

0 

0 

0 

1 

0 

Chloroplast 

0 

0 

82 

3 

0 

0 

5 

0 

0 

0 

0 

Cytoplasmic 

0 

0 

1 

266 

0 

0 

3 

12 

0 

0 

0 

Extracellular 

0 

0 

0 

1 

166 

0 

0 

0 

0 

1 

0 

Lysosomal 

0 

0 

0 

0 

5 

12 

0 

0 

0 

3 

0 

Mitochondrial 

0 

0 

2 

5 

0 

0 

94 

1 

0 

0 

0 

Nuclear 

0 

0 

0 

27 

1 

0 

3 
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10 
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0 

0 

1 

18 

2 

0 
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0 

0 
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1 
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Figure 5. Importance weights assigned to different regions of the proteins when making predictions, y-axis is true group and x-axis is 
the sequence positions. All proteins shorter than 1000 are zero padded from the middle such that the N and C terminals align. 
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A-LSTM 


R-LSTM Forward R-LSTM Backward 
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Figure 6. t-SNE plot of hidden representation for Forward and Backward R-LSTM and A-LSTM. 


dependencies. The LSTM networks have nearly as good performance as methods that use information obtained from other 
sources than the sequence (LSTM ensemble 0.902 vs. SherLoc2 0.930). Incorporating these informations into the LSTM 
models could further improve the performance of these models. However, it is our opinion that using sequence alone yields 
the biologically most relevant prediction, while the incorporation of, e.g., GO terms limits the usability of the prediction 
requiring similar proteins to be already annotated to some degree. Furthermore, as we show below, a sequence-based 
method potentially allows for a de novo identification of sequence features essential for biological function. 


Figure 0 shows where in the sequence the A-LSTM network assigns importance. Sequences from the compartments ER, 
extracellular, lysosomal, and vacuolar all belong to the secretory pathway and contain N-terminal signal peptides, which 
are clearly seen as bars close to the left edge of the plot. Some of the ER proteins additionally have bars close to the right 
edge of the plot, presumably representing KDEL-type retention signals. Golgi proteins are special in this context, since 
they are t ype I I transmembrane proteins with signal anchors, slightly further from the N-terminus than signal peptides 


(iHoglund et all 2006). Chloroplast and mitochondrial proteins also have N-terminal sorting signals, and it is apparent 


from the plot that chloroplast_transit 
signal peptides (lEmanuelsson et al 


it pe 

IE 


eptides are longer than mitochondrial transit peptides, which in turn are longer than 
20071) . For the plasma membrane category we see that some proteins have signal 
peptides, while the model generally focuses on signals, presumably transmembrane helices, scattered across the rest 
of the sequence with some overabundance close to the C-terminus. Some of thejittention focused near the C-terminus 
could also represent signals for glycosylphosphatidylinositol (GPI) anchors (lEmanuelsson et ah . 2007 ). Cytoplasmic and 
nuclear proteins do not have N-terminal sorting signals, and we see that the attention is scattered over a broader region of 
the sequences. However, especially for the cytoplasmic proteins there is some attention focused close to the N-terminus, 
presumably in order to check for the absence of signal peptides. Finally, peroxisomal proteins are known to have either 


N-terminal or C-terminal sorting signals (PTS1 and PTS2) (Emanuelsson et al. 
picked up by the attention mechanism. 


2007), but these do not seem to have been 


In Figure [7] we investigate what the convolutional filters in the model focus on. Notably the short filters focus on 
amino acids with specific characteristics, such as positively or negatively charged, whereas the longer filters seem to focus 
on distributions of amino acids across longer sequences. The arginine-rich motif in Figure 7C could represent part of a 
nuclear localization signal (NLS), while the longer motif in Figure 7D could represent the transition from transmembrane 
helix (hydrophobic) to cytoplasmic loop (in accordance with the ’’positive-inside” rule). We believe that the learned filters 
can be used to discover new sequence motifs for a large range of protein and genomic features. 
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D 



Filter Position 

Figure 7. Examples of learned filters. Filter A captures proline or trypthopan stretches, B) and C) are sensitive to positively and negatively 
charged regions, respectively. Note that for C, negative amino acids seems to suppress the output. Lastly we show a long filter which 
captures larger sequence motifs in the proteins. 


In Figure [6] we investigate whether the LSTM models are able to extract fixed length representations of variable 
length proteins. Using t-SNE we plot the LSTMs hidden representation of the sequences. It is apparent that proteins 
from the same compartment generally group together, while the cytoplasmic and nuclear categories tend to overlap. The 
corresponds with the fact that these two categories are relatively often confused, see Table[2] The categories form clusters 
which make biological sense; all the proteins with signal peptides (ER, extracellular, lysosomal, and vacuolar) lie close 
to each other in t-SNE space in all three plots, while the proteins with other N-terminal sorting signals (chloroplasts 
and mitochondria) are close in the R-LSTM plots (but not in the A-LSTM plot). Note that the lysosomal and vacuolar 
categories are very close to each other in the plots, this corresponds with the fact that these two compartments are 


considered homologous (HoglundetaL, 20061) . 


In summary we have introduced LSTM networks with convolutions for subcellular localization. By visualizing the 
learned filters we have shown that these can be interpreted as motif detectors, and lastly we have shown that the LSTM 
network can represent protein sequences as a fixed length vector in a representation that is biologically interpretable. 
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